Part 1 - Crawling Wikipedia

Introduction

We're going to create a social network of characters in the Marvel Cinematic Universe. You are looking at a Jupyter notebook. Each section is a cell that can contain text or Python code. You can run a cell by selecting it, and hitting Ctrl-Enter. You will see the results of your code as it runs. Try running the cell below.



In [15]:

    
import wikinetworking as wn
import networkx as nx
from pyquery import PyQuery
%matplotlib inline

print "OK"

OK

You just ran some Python code that imports packages. Packages are pre-written Python code. The wikinetworking package contains code for crawling, text mining and graphing Wiki articles. You can access these functions in the wn object.

Getting a list of links

Our first step is getting a list of links that we want to crawl. Wikipedia has article data organized as lists of topics. (There are many that may not be listed on this page. You should search for one that works for you.) Once you find the URL of a list that contains articles you would like to crawl, paste the URL in the variable below.



In [ ]:

    
url = "https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_film_actors"
print url

Now we can download the article and get a list of links from it.



In [ ]:

    
links = wn.filter_links(PyQuery(url=url))
print links

Many of these links may not be relevant to our topic. We can filter for links that exist inside certain types of HTML elements. You can find the type of element by inspecting a relevant link on your Wikipedia page in your browser. We can use a special type of filter called a CSS selector to get only links that are inside of specific types of elements.



In [ ]:

    
selector="th"

links = wn.filter_links(PyQuery(url=url), selector=selector)
print links

Cross referencing lists of links

Another way of filtering links is to try to cross-reference them with another list of links. First, find another URL that might shares the links you want, but excludes the links you don't want from the first list of links.



In [ ]:

    
another_url = ""
another_selector = ""
more_links = wn.filter_links(PyQuery(url=another_url), selector=another_selector)
print more_links

What if you need links from a lists of lists? You can automatically crawl a list of URLs, as well. First we need to generate a list of URLs.



In [ ]:

    
url_pattern = "https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_"

sections = [letter for letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
sections.append('0-9')

many_urls = [url_pattern + section for section in sections]
print many_urls

And then we can crawl this list of URLs.



In [ ]:

    
selector = ".hatnote"

more_links = wn.retrieve_multipage(many_urls, selector=selector, verbose=True)

Now that we have a second set of links, we can look for the intersection of the two lists. That should give us only the URLs we want.



In [ ]:

    
relevant_links = wn.intersection(links, more_links)
print relevant_links

Let's save these links into a file so we don't have to download the data again.



In [ ]:

    
wn.write_list(relevant_links, "relevant_links.txt")

Let's also make sure we can load the data after we've saved it.



In [ ]:

    
relevant_links = wn.read_list("relevant_links.txt")
print relevant_links

Crawling The Links

Now that we have the relevant_links list, we just need to choose a starting article.



In [ ]:

    
starting_url="/wiki/Iron_Man"

raw_crawl_data = wn.crawl(starting_url, accept=relevant_links)

import json
print json.dumps(raw_crawl_data, sort_keys=True, indent=4)

We can "flatten" the data and save it for convenience.



In [ ]:

    
graph_data = wn.undirected_graph(raw_crawl_data)

import json
print json.dumps(graph_data, sort_keys=True, indent=4)

wn.save_dict(graph_data, "undirected_graph.json")

Next...

Now we can draw the graph. Go to Part 2 - Drawing the Network...